AITopics | query column

Collaborating Authors

query column

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns

Guo, Yuxiang, Mao, Yuren, Hu, Zhonghao, Chen, Lu, Gao, Yunjun

arXiv.org Artificial IntelligenceFeb-23-2025

Semantic join discovery, which aims to find columns in a table repository with high semantic joinabilities to a query column, is crucial for dataset discovery. Existing methods can be divided into two categories: cell-level methods and column-level methods. However, neither of them ensures both effectiveness and efficiency simultaneously. Cell-level methods, which compute the joinability by counting cell matches between columns, enjoy ideal effectiveness but suffer poor efficiency. In contrast, column-level methods, which determine joinability only by computing the similarity of column embeddings, enjoy proper efficiency but suffer poor effectiveness due to the issues occurring in their column embeddings: (i) semantics-joinability-gap, (ii) size limit, and (iii) permutation sensitivity. To address these issues, this paper proposes to compute column embeddings via proxy columns; furthermore, a novel column-level semantic join discovery framework, Snoopy, is presented, leveraging proxy-column-based embeddings to bridge effectiveness and efficiency. Specifically, the proposed column embeddings are derived from the implicit column-to-proxy-column relationships, which are captured by the lightweight approximate-graph-matching-based column projection.To acquire good proxy columns for guiding the column projection, we introduce a rank-aware contrastive learning paradigm. Extensive experiments on four real-world datasets demonstrate that Snoopy outperforms SOTA column-level methods by 16% in Recall@25 and 10% in NDCG@25, and achieves superior efficiency--being at least 5 orders of magnitude faster than cell-level solutions, and 3.5x faster than existing column-level methods.

joinability, proxy column, snoopy, (14 more...)

arXiv.org Artificial Intelligence

2502.16813

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.04)
Asia > China > Zhejiang Province > Ningbo (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

DeepJoin: Joinable Table Discovery with Pre-trained Language Models

Dong, Yuyang, Xiao, Chuan, Nozawa, Takuma, Enomoto, Masafumi, Oyamada, Masafumi

arXiv.org Artificial IntelligenceJun-23-2023

Due to the usefulness in data enrichment for data analysis tasks, joinable table discovery has become an important operation in data lake management. Existing approaches target equi-joins, the most common way of combining tables for creating a unified view, or semantic joins, which tolerate misspellings and different formats to deliver more join results. They are either exact solutions whose running time is linear in the sizes of query column and target table repository or approximate solutions lacking precision. In this paper, we propose Deepjoin, a deep learning model for accurate and efficient joinable table discovery. Our solution is an embedding-based retrieval, which employs a pre-trained language model (PLM) and is designed as one framework serving both equi- and semantic joins. We propose a set of contextualization options to transform column contents to a text sequence. The PLM reads the sequence and is fine-tuned to embed columns to vectors such that columns are expected to be joinable if they are close to each other in the vector space. Since the output of the PLM is fixed in length, the subsequent search procedure becomes independent of the column size. With a state-of-the-art approximate nearest neighbor search algorithm, the search time is logarithmic in the repository size. To train the model, we devise the techniques for preparing training data as well as data augmentation. The experiments on real datasets demonstrate that by training on a small subset of a corpus, Deepjoin generalizes to large datasets and its precision consistently outperforms other approximate solutions'. Deepjoin is even more accurate than an exact solution to semantic joins when evaluated with labels from experts. Moreover, when equipped with a GPU, Deepjoin is up to two orders of magnitude faster than existing solutions.

deepjoin, information retrieval, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2212.07588

Country:

North America > United States > California (0.04)
Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
Oceania > Australia (0.04)
(7 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes

Song, Jie, He, Yeye

arXiv.org Artificial IntelligenceApr-13-2021

Complex data pipelines are increasingly common in diverse applications such as BI reporting and ML modeling. These pipelines often recur regularly (e.g., daily or weekly), as BI reports need to be refreshed, and ML models need to be retrained. However, it is widely reported that in complex production pipelines, upstream data feeds can change in unexpected ways, causing downstream applications to break silently that are expensive to resolve. Data validation has thus become an important topic, as evidenced by notable recent efforts from Google and Amazon, where the objective is to catch data quality issues early as they arise in the pipelines. Our experience on production data suggests, however, that on string-valued data, these existing approaches yield high false-positive rates and frequently require human intervention. In this work, we develop a corpus-driven approach to auto-validate \emph{machine-generated data} by inferring suitable data-validation "patterns" that accurately describe the underlying data domain, which minimizes false positives while maximizing data quality issues caught. Evaluations using production data from real data lakes suggest that Auto-Validate is substantially more effective than existing methods. Part of this technology ships as an Auto-Tag feature in Microsoft Azure Purview.

digit, proceedings, query column, (14 more...)

arXiv.org Artificial Intelligence

2104.04659

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Michigan (0.04)
Asia > China (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology (0.48)
Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach

Dong, Yuyang, Takeoka, Kunihiro, Xiao, Chuan, Oyamada, Masafumi

arXiv.org Artificial IntelligenceMar-29-2021

Finding joinable tables in data lakes is key procedure in many applications such as data integration, data augmentation, data analysis, and data market. Traditional approaches that find equi-joinable tables are unable to deal with misspellings and different formats, nor do they capture any semantic joins. In this paper, we propose PEXESO, a framework for joinable table discovery in data lakes. We embed textual values as high-dimensional vectors and join columns under similarity predicates on high-dimensional vectors, hence to address the limitations of equi-join approaches and identify more meaningful results. To efficiently find joinable tables with similarity, we propose a block-and-verify method that utilizes pivot-based filtering. A partitioning technique is developed to cope with the case when the data lake is large and the index cannot fit in main memory. An experimental evaluation on real datasets shows that our solution identifies substantially more tables than equi-joins and outperforms other similarity-based options, and the join results are useful in data enrichment for machine learning tasks. The experiments also demonstrate the efficiency of the proposed method.

artificial intelligence, data mining, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2010.13273

Country:

North America > United States > Alaska (0.04)
Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback